多标签图像识别是一个基本又实用的任务,因为真实世界的图像固有地拥有多个语义标签。然而,由于输入图像和输出标签空间的复杂性,难以收集大规模的多标签注释。为了降低注释成本,我们提出了一种结构化语义传输(SST)框架,使得能够培训具有部分标签的多标签识别模型,即,仅在每个图像中丢失其他标签(也称为未知标签)。该框架由两个互补传输模块组成,探索图像内和交叉图像语义相关性,以传输已知标签的知识,以为未知标签生成伪标签。具体地,一个图像内语义传输模块学习特定于图像的标签共出矩阵,并将已知的标签映射到基于该矩阵的补充未知标签。同时,交叉图像传输模块学习特定于类别的特征相似性,并帮助您具有高相似之处的补充未知标签。最后,已知和生成的标签都用于训练多标签识别模型。对Microsoft Coco,Visual Genome和Pascal VOC数据集的广泛实验表明,所提出的SST框架在当前最先进的算法上获得了卓越的性能。代码可用于\ url {https:/github.com/hcplab-sysu/sst-ml -pl
translated by 谷歌翻译
为了解决不同面部表情识别(FER)数据集之间的数据不一致的问题,近年来许多跨域FER方法(CD-FERS)已被广泛设计。虽然每个声明要实现卓越的性能,但由于源/目标数据集和特征提取器的不一致选择,缺乏公平的比较。在这项工作中,我们首先分析了这些不一致的选择造成的性能效果,然后重新实施了一些良好的CD-FER和最近发布的域适应算法。我们确保所有这些算法采用相同的源数据集和特征提取器,以便进行公平CD-FER评估。我们发现大多数主要的领先算法使用对抗性学习来学习整体域的不变功能来缓解域移位。然而,这些算法忽略了局部特征,这些功能在不同的数据集中更可转换,并为细粒度适应提供更详细的内容。为了解决这些问题,我们通过开发新的对抗图表示适应(AGRA)框架,将图形表示传播与对抗域整体局部特征共同适应的对抗。具体地,它首先构建两个图形,以分别在每个域内和跨不同的域内相关的全部和局部区域。然后,它从输入图像中提取整体本地特征,并使用可学习的每类统计分布来初始化相应的图形节点。最后,采用两个堆叠的图形卷积网络(GCNS)在每个域内传播全部本地功能,以探索它们的交互和整体域的不同域,用于全部局部功能共同适应。我们对几个流行的基准进行了广泛和公平的评估,并表明建议的AGRA框架优于以前的最先进的方法。
translated by 谷歌翻译
由于医学图像的数据稀缺性和数据异质性是普遍存在的,因此在部署到新站点时,使用先前的归一化方法训练有素的卷积神经网络(CNN)可能会表现不佳。但是,现实世界应用程序的可靠模型应该能够在分布(IND)和分布(OOD)数据(例如新站点数据)上很好地概括。在这项研究中,我们提出了一种称为窗口归一化(WIN)的新型归一化技术,这是现有标准化方法的简单而有效的替代方法。具体而言,赢得了与特征窗口上计算的本地统计数据的归一化统计数据。此功能级增强技术可以很好地规范模型,并显着改善了其OOD的概括。利用它的优势,我们提出了一种称为Win Win的新型自我鉴定方法,以进一步改善分类中的OOD概括。通过两次向前传球和一致性约束可以轻松实现双赢,这对于现有方法来说是一个简单的扩展。关于各种任务(例如青光眼检测,乳腺癌检测,染色体分类,视盘和杯赛分割等)和数据集(26个数据集)的广泛实验结果证明了我们方法的一般性和有效性。该代码可从https://github.com/joe1chief/windownormalizaion获得。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译
In this tutorial paper, we look into the evolution and prospect of network architecture and propose a novel conceptual architecture for the 6th generation (6G) networks. The proposed architecture has two key elements, i.e., holistic network virtualization and pervasive artificial intelligence (AI). The holistic network virtualization consists of network slicing and digital twin, from the aspects of service provision and service demand, respectively, to incorporate service-centric and user-centric networking. The pervasive network intelligence integrates AI into future networks from the perspectives of networking for AI and AI for networking, respectively. Building on holistic network virtualization and pervasive network intelligence, the proposed architecture can facilitate three types of interplay, i.e., the interplay between digital twin and network slicing paradigms, between model-driven and data-driven methods for network management, and between virtualization and AI, to maximize the flexibility, scalability, adaptivity, and intelligence for 6G networks. We also identify challenges and open issues related to the proposed architecture. By providing our vision, we aim to inspire further discussions and developments on the potential architecture of 6G.
translated by 谷歌翻译
In this paper, we investigate the joint device activity and data detection in massive machine-type communications (mMTC) with a one-phase non-coherent scheme, where data bits are embedded in the pilot sequences and the base station simultaneously detects active devices and their embedded data bits without explicit channel estimation. Due to the correlated sparsity pattern introduced by the non-coherent transmission scheme, the traditional approximate message passing (AMP) algorithm cannot achieve satisfactory performance. Therefore, we propose a deep learning (DL) modified AMP network (DL-mAMPnet) that enhances the detection performance by effectively exploiting the pilot activity correlation. The DL-mAMPnet is constructed by unfolding the AMP algorithm into a feedforward neural network, which combines the principled mathematical model of the AMP algorithm with the powerful learning capability, thereby benefiting from the advantages of both techniques. Trainable parameters are introduced in the DL-mAMPnet to approximate the correlated sparsity pattern and the large-scale fading coefficient. Moreover, a refinement module is designed to further advance the performance by utilizing the spatial feature caused by the correlated sparsity pattern. Simulation results demonstrate that the proposed DL-mAMPnet can significantly outperform traditional algorithms in terms of the symbol error rate performance.
translated by 谷歌翻译